Analyzing Past Awards using R

A Beginners’ Guide with As Little Code-tweaking as Possible

Author
Affiliation

JMU REDI Office of Research Development

Published

March 20, 2025

Introduction

How can we analyze and present past award data with no to minimal knowledge of R?

Well, that is difficult, especially when you come across problems. But it is not impossible. I will give as gentle an introduction to analyzing past awards in R as I can.

You will be able to rely mostly on the code here, and I will explain what you need to edit. Where stuck, Github Copilot, which is free for those with university affiliations, or generative AI tools can come to your help. However, you would need to explain your error clearly and step-by-step to get the best results.

Installing R and RStudio

First, you need to install R and RStudio. R is the programming language, and RStudio is an integrated development environment (IDE) for R. You can download R from here, and RStudio from here.

You can refer to this video for a step-by-step guide on how to install R and RStudio.

Downloading the Repository

After installing R and RStudio, open RStudio, and click on the “File” menu, then “New Project,” “Version Control,” and then “Git.”

Under the “Repository URL” box, paste the URL of the repository: https://github.com/kjayhan/past_grants.git. Then give this project directory a name, and select a directory folder to save it in, then click “Create Project.”

This repository contains the Quarto document, the R code, sample awards data, and bonus .scss file that you need to analyze past awards.

Important

When you are analyzing your past award data, make sure to put it in the data folder as a “.csv” file within the same directory folder you just created by downloading my repository.

Installing Packages

After downloading the repository, you need to install the packages that we will be using. You will do this only once, just like installing an app on your phone. You can do this by running the following code in the R console. If a code in R begins with “#”, it is a comment, and it does not run. If you uncomment it by removing the “#”, it will run. For the following code, you can uncomment the following lines in the code chunk to install the packages.

Show the code
# install.packages("tidyverse")
# install.packages("tidytext")
# install.packages("stopwords")
# install.packages("topicmodels")
# install.packages("tm")
# install.packages("ggraph")
# install.packages("igraph")
# install.packages("janitor")
# install.packages("ggeasy")

Then we load the packages we will be using. This is similar to opening an app on your phone to be able to use it. You can do this by running the following code in the R console.

Show the code
library(tidyverse) # Data manipulation and visualization
library(janitor) # Data cleaning tools
library(tidytext) # Text mining and processing
library(stopwords) # Access to stopword lists
library(topicmodels) # For topic modeling
library(tm) # Text mining utilities
library(ggraph) # For creating graph-based visualizations
library(igraph) # Graph analysis
library(ggeasy) # Easy labeling of ggplot2 graphs

Loading the Data

Replace the name of the file with the name of your file name.

The sample data I use is from NSF Division of Social and Economic Sciences (SBE/SES).

Show the code
awards <- read_csv("data/awards.csv", locale = locale(encoding = "Latin1"))

Basic Data Cleaning

Important

If the column name for the amount or date are different, replace all instances of “awarded_amount_to_date” with the correct column name for amount, and the same for “last_amendment_date.” You can do that by using the find and replace feature by clicking CTRL+F or Command+F.

We will begin with cleaning the column names and the data to make them more readable and usable.

Show the code
awards <- janitor::clean_names(awards)

The date format is set to “%m/%d/%Y” (e.g., 05/15/2024) in the code below. If the date format in your csv file is different, you can change it accordingly. See this page.

For now I will limit this analysis to projects that begin with “Collaborative Research” in their titles, to show you how to filter the data. You can change this to whatever you want.

Important

Remove or uncomment the following line if you don’t want to filter anything based on title.

If you want to filter based on title, you can change the title to whatever you want. You can also use regex to filter based on a pattern. For example, if you want to filter for titles that contain “Collaborative Research” OR “Collaborative” OR “Research,” you can use str_detect(title, "Collaborative|Research") instead. For more on regex, see this page.

Show the code
awards <- awards |>
  filter(str_detect(title, "Collaborative Research"))

Now, we make sure that the amount is numeric and the date is in the correct format.

Important

If the amount is already numeric, comment the following line.

Show the code
awards$awarded_amount_to_date <- parse_number(awards$awarded_amount_to_date)

The date format is set to “%m/%d/%Y” (e.g., 05/15/2024) in the code below. If your date format in the csv file is different, you can change it accordingly. See this page.

Show the code
awards$last_amendment_date <- parse_date(awards$last_amendment_date, format = "%m/%d/%Y")

You can filter for the years you are interested in. Replace 2016 with the year you are interested in. Greater than 2016 (i.e. > 2016) will give you the years after 2016, not including 2016. For more on data transformation, see this page.

Show the code
awards <- awards |>
  filter(year(last_amendment_date) > 2016)

We will now replace HTML tags from the abstract column to keep only the text. I would also like to remove the following sentence, because it conflates text analysis, as it is in the end of every abstract: “This award reflects NSF’s statutory mission and has been deemed worthy of support through evaluation using the Foundation’s intellectual merit and broader impacts review criteria.”

Show the code
awards$abstract <- gsub("</>|<br/>|\r\r|\r \r|This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.", " ", awards$abstract)

Scripting in Quarto

You can learn more about Quarto here.

You can write in Quarto using RStudio. You can write in plain text, and it will be converted to HTML or PDF. Headings begin with “#”, and subheadings with “##”, “###” depending on what level of subheading you want.

Introduction

You can write your introduction here. This section is not code. You can write your introduction in plain text.

Awarded Amounts

In this section, I give an analysis of the awarded amounts.

Some awards are collaborative projects. In other words, the same project (with the same title and abstract) has more than one entry, because the amount is divided across different institutions. The following code aggregates the amount for collaborative projects.

Show the code
awards_g <- awards |>
  group_by(title) |>
  summarize(n = n(),
            amount = sum(awarded_amount_to_date, na.rm = T)) |>
  ungroup()
Important

I give you two options for a summary of the awarded amounts in the template Make sure to remove one of them.

Awarded Amounts Option 1

If the awarded amount range is “normal,” you can use the following code. Otherwise, delete the following code chunk. If you want to make changes to the code (color, font size, axis breaks etc.), your best bet is getting help from the internet or a generative AI tool. If you want to learn more about ggplot2, you can check out this page.

This code creates a density plot of the awarded amounts. You can change the color of the lines by changing the hex color codes. You can also change the x and y axis labels by changing the text in the labs function.

Show the code
density <- awards_g |>
  ggplot(aes(round(amount/1e3))) +
  geom_density() +
  geom_vline(xintercept = median(round(awards_g$amount/1e3)), color = "#450084") +
  geom_vline(xintercept = mean(round(awards_g$amount/1e3)), color = "#CBB677") +
  geom_text(aes(x = median(round(amount/1e3)), y = Inf, label = "Median"), color = "#450084", vjust=1,hjust=2.1,size=3,angle=90) +
  geom_text(aes(x = mean(round(amount/1e3)), y = Inf, label = "Mean"), color = "#CBB677", vjust=1,hjust=2.1,size=3,angle=90) +
  scale_x_continuous(breaks = c(seq(0, max(round(awards_g$amount/1e3)), by = 1000), max(round(awards_g$amount/1e3)))) +
  #coord_flip() +
  labs(x = "Awarded Amounts (in $1000)",
       y = "Density") +
  theme_bw() +
  theme(legend.position="none",
        axis.text.y=element_blank(),
        axis.ticks.y=element_blank(),
        axis.text.x = element_text(angle = 90, hjust = 1))

density

Show the code
# Save the plot to a file. You can change the file name and the path to the file.

ggsave("data/density_normal.png", density, dpi = 300)

Awarded Amounts Option 2

If your code is skewed, as in this example, you can use the following code which takes the log of the awarded amounts. Make sure to delete Option 1 then.

Show the code
density_log <- awards_g |>
  ggplot(aes(round(amount/1e3))) +
  geom_density() +
  scale_x_continuous(trans='log2', 
                     breaks = c(2^seq(0, ceiling(log2(max(round(awards_g$amount/1e3))))), round(mean(round(awards_g$amount/1e3)),0), round(median(round(awards_g$amount/1e3)),0))) +
  geom_vline(xintercept = median(round(awards_g$amount/1e3)), color = "#450084") +
  geom_vline(xintercept = mean(round(awards_g$amount/1e3)), color = "#CBB677") +
  geom_text(aes(x = median(round(amount/1e3)), y = Inf, label = "Median"), color = "#450084", vjust=1,hjust=2.1,size=3,angle=90) +
  geom_text(aes(x = mean(round(amount/1e3)), y = Inf, label = "Mean"), color = "#CBB677", vjust=1,hjust=2.1,size=3,angle=90) +
  labs(x = "Awarded Amounts in $1000 (log transformed)",
       y = "Density") +
  theme_bw() +
  theme(legend.position="none",
        axis.text.y=element_blank(),
        axis.ticks.y=element_blank(),
        axis.text.x = element_text(angle = 90, hjust = 1))


density_log

Show the code
# Save the plot to a file. You can change the file name and the path to the file.

ggsave("data/density_log.png", density_log, dpi = 300)

This is text and inline code. Feel free to change regular text. If you want to change the inline code, which begins with “`r”, followed by R code and ends with “`.”

The average of filtered NSF Division of Social and Economic Sciences (SBE/SES) projects is $438,731.6. The median amount is $366,072.

Abstract Analysis

Word Frequency in Abstracts

Here we analyze the abstracts of the awarded projects. We first remove duplicate titles/ abstracts.

Show the code
# Remove duplicate titles
awards_abstracts <- awards |> distinct(title, .keep_all = TRUE)
Important

We will remove common stopwords (e.g., this, that). We will also remove custom stopwords. You can add or remove custom stopwords within quotation marks separated by commas as you see fit.

Show the code
custom_stopwords <- c("will", "can", "br", "s", "g", "e", "(anr).")

We will remove extra whitespace, convert to lowercase, remove punctuation, remove numbers, remove custom stopwords, remove default stopwords, and remove extra whitespace again.

Show the code
awards_abstracts$abstract <- awards_abstracts$abstract |> 
  str_squish() |>  # Remove extra whitespace
  tolower()         # Convert to lowercase
  

text_corpus <- Corpus(VectorSource(awards_abstracts$abstract)) |> 
  # tm_map(content_transformer(function(x) gsub("</>|<br/>", " ", x))) |> # Remove HTML tags
  tm_map(removePunctuation) |> # Remove punctuation
  tm_map(removeNumbers) |> # Remove numbers
    tm_map(removeWords, custom_stopwords) |> # Remove custom stopwords
  tm_map(removeWords, stopwords("english")) |> # Remove default stopwords
  tm_map(stripWhitespace) # Remove extra whitespace

tdm <- TermDocumentMatrix(text_corpus) # Create a term-document matrix
word_freqs <- sort(rowSums(as.matrix(tdm)), decreasing = TRUE) # Sort terms by frequency
tdm_df <- data.frame(word = names(word_freqs), freq = word_freqs) # Create a data frame for the word frequencies

freq <- ggplot(tdm_df[1:20,], aes(x = reorder(word, freq), y = freq)) + # Plot top 20 most frequent words
  geom_bar(stat = "identity") +
  coord_flip() + # Flip axes for better readability
  labs(x = "Terms", y = "Count", title = "Most Common Words in Abstracts") +
  ggeasy::easy_center_title() # Center the title in the plot

freq

Show the code
ggsave("data/freq.png", freq, dpi = 300)

Topic Modeling in Abstracts

We will use Latent Dirichlet Allocation (LDA) for topic modeling. Here, all you need to do is decide how many topics you want to extract. I will use 4 topics in this example. You can change the number of topics by changing the value of “k” in the LDA function. Play around with it to see what works best for your data.

Show the code
dtm <- DocumentTermMatrix(text_corpus)

# Remove empty documents from the dtm
row_totals <- rowSums(as.matrix(dtm))  # Compute row-wise term sums
dtm <- dtm[row_totals > 0, ]  # Keep only non-empty rows


lda <- LDA(dtm, k = 4, control = list(seed = 1234))
 
topics <- tidy(lda, matrix = "beta")

 
top_terms <- topics |>
  group_by(topic) |>
  slice_max(beta, n = 10) |>
  ungroup() |>
  arrange(topic, -beta)

tm <- top_terms |>
  mutate(term = reorder_within(term, beta, topic)) |>
  ggplot(aes(beta, term, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  scale_y_reordered() +
  scale_x_continuous(labels = scales::percent_format(accuracy = 0.1)) +
  theme(axis.text.y = element_text(size = 12),
      axis.text.x = element_text(size = 12),
      axis.title = element_text(size = 12))

tm

Show the code
ggsave("data/tm.png", tm, dpi = 300)

Word Networks (Bigrams) in Abstracts

We will create a word network (bigram: two words) from the abstracts. You can change the bigrams to trigrams (three words) or more by changing the value of “n” in the unnest_tokens function. I will use bigrams in this example. You can change the value of “n” to whatever you want. I will also filter out common stopwords and custom stopwords we previously defined.

Show the code
awards_abstracts <- awards_abstracts |>
  select(title, abstract) # Select only the title and abstract columns
  
bigrams <- awards_abstracts |> 
  unnest_tokens(bigram, abstract, token = "ngrams", n = 2) |> # Tokenize into bigrams
  separate(bigram, c("word1", "word2"), sep = " ") |> # Split bigrams into two words
  filter(!word1 %in% c(stopwords("english"), custom_stopwords),
         !word2 %in% c(stopwords("english"), custom_stopwords)) # Remove stopwords from bigrams

bigram_counts <- bigrams |> count(word1, word2, sort = TRUE) # Count bigram frequencies

We will create a graph from the bigrams. You can change the threshold for filtering the bigrams by changing the value of “n” in the filter function. I will use 20 in this example to include only the bigrams that appear more than 20 or more times. You can change it to whatever you want.

Show the code
bigram_graph <- bigram_counts |> filter(n >= 20) |> # Filter to include only bigrams with frequency > 40
  igraph::graph_from_data_frame() # Create a graph from the bigram data frame

bigram <- ggraph(bigram_graph, layout = "fr") + # Create a force-directed graph of bigrams
  geom_edge_link() +
  geom_node_point() +
  geom_node_text(aes(label = name), vjust = 1, hjust = 1, repel = TRUE) # Add labels to nodes with repelling effect

bigram

Show the code
ggsave("data/bigram.png", bigram, dpi = 300)

Abstract Takeaways

Write your analysis based on above text analysis here.

AI Summary of Awarded Proposals

You can export the abstracts to a csv file, and then use AI tools to summarize and categorize the abstracts.

Show the code
# exporting the abstract data to a csv file.

write_csv(awards_abstracts, "data/awards_abstracts.csv")

Open the created csv file (awards_abstract file in the data folder) in Excel or another program, and save it as a .pdf file.

Then go to NotebookLM, and upload the pdf file. You can then use the AI to summarize and categorize the abstracts.

You can also use other AI tools for this task.

Rendering

After you finish editing the Quarto document, you can render it to an .html file (or .pdf, .docx, .pptx – but you need much more customization for that).

You can do this by clicking on the “Render” button in the top center of the RStudio window (blue right-facing arrow icon). You can also use the keyboard shortcut CTRL+Shift+K (or Command+Shift+K on Mac) to render the document.

Congratulations

Congratulations! You have successfully analyzed past award data using R and RStudio. You can now use this knowledge to analyze your own past award data. If you have any questions or need help, feel free to reach out to me (after you consult with your AI assistant(s) first!).